DML Operation in Hive Table

INSERT ... VALUES, UPDATE, DELETE, and MERGE SQL statements are supported in Apache Hive 0.14 and later. The INSERT ... VALUES statement enable users to write data to Apache Hive from values provided in SQL statements. The UPDATE and DELETE statements enable users to modify and delete values already written to Hive.The MERGE statement streamlines UPDATEs, DELETEs, and change data capture operations by drawing on coexisting tables. All four statements support auto-commit, which means that each statement is a separate transaction that is automatically committed after the SQL statement is executed.

The INSERT ... VALUES, UPDATE, and DELETE statements require the following property values in the hive-site.xml configuration file:

Configuration PropertyRequired Value

hive.enforce.bucketing

true

hive.exec.dynamic.partition.mode

nonstrict

INSERT Statement

 CREATE TABLE students 

(name VARCHAR(64), 

age INT, gpa DECIMAL(3,2)) 

CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;

INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);

CREATE TABLE pageviews

(userid VARCHAR(64), 

link STRING, from STRING) PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC;

INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23') VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);

INSERT INTO TABLE pageviews PARTITION (datestamp) VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21'); 

UPDATE Statement
Use the UPDATE statement to modify data already written to Apache Hive. Depending on the condition specified in the optional WHERE clause, an UPDATE statement may affect every row in a table. You must have both the SELECT and UPDATE privileges to use this statement.

There are many approaches that you can follow to update Hive table, such as:

  • Use Temporary Hive Table to Update Table.
  • Set TBLPROPERTIES to enable ACID transactions on Hive Tables.
  • Use HBase to update records and create Hive External table to display HBase Table data.

UPDATE tablename SET column = value [, column = value ...] [WHERE expression];

The UPDATE statement has the following limitations:

  • The expression in the WHERE clause must be an expression supported by a Hive SELECT clause.
  • Partition and bucket columns cannot be updated.
  • Query vectorization is automatically disabled for UPDATE statements. However, updated tables can still be queried using vectorization.
  • Subqueries are not allowed on the right side of the SET statement.

The following example demonstrates the correct usage of this statement:

UPDATE students SET name = null WHERE gpa <= 1.0;

DELETE Statement
Use the DELETE statement to delete data already written to Apache Hive.

DELETE FROM tablename [WHERE expression];

The DELETE statement has the following limitation: query vectorization is automatically disabled for the DELETE operation. However, tables with deleted data can still be queried using vectorization.

The following example demonstrates the correct usage of this statement:

DELETE FROM students WHERE gpa <= 1,0;

MERGE Statement
Use the MERGE statement to efficiently perform record-level INSERT, UPDATE, and DELETE operations within Hive tables. The MERGE statement can be a key tool of Hadoop data management.

The MERGE statement is based on ANSI-standard SQL.

The following SQL statement is an example of valid MERGE usage:

merge into customer
using ( select * from new_customer_stage) sub
on sub.id = customer.id
when matched then update set name = sub.name, state = sub.new_state
when not matched then insert values (sub.id, sub.name, sub.state);

Method -1
Let us consider you want to update column dept_name of dept by taking data from staging table value dept_name1.

Below is the sample SQL update statement:

update a from dept a, dept_stg b
set dept_name = b.dept_name
where a.dept_no=b.dept_no;

create table dept_stg like dept;

create table dept_temp like dept;

insert data into stage table with updated records

create table dept
(
    dept_no int,
   d_name string,
    l_loc` string
)

insert into dept_temp
select
 a.dept_no
,coalesce( b.d_name , a.d_name) as d_name
,a.l_loc
from dept a
left join dept_stg b
on a.dept_no = b.dept_no ;

insert ovewrite into table dept select * from dept_temp;

Method-2
To updates in Hive minimum requisite to perform Hive CRUD using ACID operations is:
  • Hive version 0.14
  • File format must be in ORC file format with TBLPROPERTIES(‘transactional’=’true’)
  • Table must be CLUSTERED BY with some Buckets.
UPDATE tbl_name SET upd_column = new_value
WHERE upd_column = current_value;

DROP TABLE IF EXISTS hello_acid;

CREATE TABLE hello_acid
(
  key int
, value int
)
PARTITIONED BY (load_date date)
CLUSTERED BY(key) INTO 3 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='true');

INSERT INTO hello_acid partition (load_date='2016-03-01') VALUES (1, 1);
INSERT INTO hello_acid partition (load_date='2016-03-02') VALUES (2, 2);
INSERT INTO hello_acid partition (load_date='2016-03-03') VALUES (3, 3);
SELECT * FROM hello_acid;

DELETE FROM hello_acid WHERE key = 2;
UPDATE hello_acid SET value = 10 WHERE key = 3;
SELECT * FROM hello_acid;

Explode in Hive
create table Product
(
id Int,
ProductName String,
ProductColorOptions ARRAY<STRING>
);

insert into  product select 1,'Watches',array('Red','Green')
insert into  product select 2,'Clothes',array('Blue','Green')
insert into  product select 3,'Books',array('Blue','Green','Red')

select explode(ProductColorOptions) from product;

Takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows. UDTFs can be used in the SELECT expression list and as a part of LATERAL VIEW.

SELECT p.id,p.productname,colors.colorselection FROM product P
LATERAL VIEW EXPLODE(p.ProductColorOptions) colors as colorselection;

Apache Hive Fixed-Width File Loading Options
There are couple of options available to load fixed width file into Hive tables.

  • Create Hive external table using SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’ and load target table.
  • Load fixed-width file in single column table and use substring to extract require fields.
1652401101720123930621977TXNPUES
1916701011420132368183100OUNHQEX
1916601011420133714468136GHAKASC
1916701011420132632703881PAHFSAP
8049501112120122766389794XDZANTV
8050701112120124609266335BWWYEZL
8050901112120121092717420QJYPKVO
8049701112120123386366766SOQLCMU
1916601011420135905893739FYIWNKA

CREATE EXTERNAL TABLE fixed_w_table
(userid STRING,
sl STRING,
date_col string,
ID string,
key_col string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(.{5})(.{2})(.{8})(.{10})(.{7}).*" )
LOCATION '/data/fw';

load data local inpath '/home/tsipl1038/fixdwidth.txt' into table fixed_w_table;

 create table single_col_stg ( id string);

select substr(id,1,5) as userid, substr(id,6,2) as sl,substr(id,8,8) as date_col,substr(id,16,10) as ID,substr(id,26,7) as key_col from single_col_stg;

No comments:

Post a Comment